Simulating

September 8 - 17, 2025

Jo Hardin

Agenda 9/8/25

  1. Simulating to understand bias-variance trade-off
  2. map()

Bias-Variance Trade-off

What is variance? What is bias?

Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training set.

Bias refers to the error that is introduced by approximating the “truth” by a model which is too simple.

Mathematically

We can measure how well a model does by looking at how close the fit of the model (\(\hat{f}\)) is to the observed data (\(y\)).

\[MSE = E\bigg[(Y - \hat{f}(x))^2\bigg]\]

Because \(Y = f(x) + \epsilon\), we know that:

\[\begin{eqnarray*} MSE &=& E\bigg[(Y - \hat{f}(x))^2\bigg]\\ &=& E\bigg[(f(x) + \epsilon - \hat{f}(x))^2\bigg]\\ &=& E\bigg[(f(x) - \hat{f}(x))^2\bigg] + 2 E\bigg[(f(x) - \hat{f}(x)) \cdot \epsilon \bigg] + E[\epsilon^2] \end{eqnarray*}\]

Term 3: The third term is \(\sigma^2\), the irreducible error (the part that even the perfect model can’t predict).

Term 2: Turns out to be zero because the errors are assumped to be independent of the explanatory variable, so the expected value can filter through.

Term 1: Takes a little bit of work to expand (something you’d do in, say, Math 152)1.

Simplified Term 1

\[E\bigg[(f(x) - \hat{f}(x))^2\bigg] = \bigg( f(x) - E[\hat{f}(x)] \bigg)^2 + E\bigg[ (E[\hat{f}(x)] - \hat{f}(x))^2 \bigg]\]

Where:

\(\bigg[\) Bias of \(\hat{f}(x) \bigg]^2\) is \(\bigg( f(x) - E[\hat{f}(x)] \bigg)^2\)

Variance of \(\hat{f}(x)\) is \(E\bigg[ (E[\hat{f}(x)] - \hat{f}(x))^2 \bigg]\)

MSE

\[\begin{eqnarray*} MSE &=& E\bigg[(Y - \hat{f}(x))^2\bigg]\\ &=& \bigg[ \mbox{ Bias of } \hat{f}(x) \bigg]^2 + \mbox{ Variance of } \hat{f}(x) + \sigma^2 \end{eqnarray*}\]

Simulation

Let’s consider a model: \(Y = f(x) + \epsilon\)

num_x <- length(seq(0, 10, by = 0.5))
set.seed(4774)
bias_var_data <- tibble(ex = seq(0, 10, by = 0.5),
                        eps = rnorm(num_x, mean = 0, sd = 5),
                        why = ex^2 + eps) |> 
  mutate(underfit = lm(why ~ ex)$fitted,
         `good fit` = lm(why ~ ex + I(ex^2))$fitted,
         overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") ))

bias_var_data |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point() + 
  geom_line(aes(y = prediction)) + 
  facet_grid(~ fit)

Errors

Which is worse? Underfitting or overfitting? How can we measure the error?

# A tibble: 3 × 2
  fit      MSE_OG
  <fct>     <dbl>
1 underfit   74.8
2 good fit   21.9
3 overfit    10.5
bias_var_data |> 
  group_by(fit) |> 
  summarize(MSE_OG = mean((why - prediction)^2))

Fit to new data

WAIT! We don’t want the error on the data that built the model, we want the error on the population.

# A tibble: 3 × 2
  fit      MSE_newdata
  <fct>          <dbl>
1 underfit        98.6
2 good fit        27.3
3 overfit         35.8
set.seed(470)
new_data <- tibble(ex = seq(0, 10, by = 0.5),
                   eps = rnorm(num_x, mean = 0, sd = 5),
                   why = ex^2 + eps)

bias_var_data |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(data = new_data, aes(x = ex, y = why)) + 
  geom_line(aes(y = prediction)) + 
  facet_grid(~ fit)
lm1 <- bias_var_data |> 
  filter(fit == "underfit") |> 
  lm(why ~ ex, data = _)

lm2 <- bias_var_data |> 
  filter(fit == "good fit") |> 
  lm(why ~ ex + I(ex^2), data = _)

lm3 <- bias_var_data |> 
  filter(fit == "overfit") |> 
  lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + 
       I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10), data = _)

new_data |> 
  mutate(underfit = predict.lm(lm1, newdata = new_data),
         `good fit` = predict(lm2, newdata = new_data),
         overfit = predict(lm3, newdata = new_data)) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") )) |> 
  group_by(fit) |> 
  summarize(MSE_newdata = mean((why - prediction)^2))

Back to bias and variance

set.seed(47)
datasets <- 1:9 |> 
  map_dfr(\(i) {
    tibble(
      ex  = seq(0, 10, by = 0.5),
      eps = rnorm(num_x, mean = 0, sd = 5),
      why = ex^2 + eps,
      dataset = i
    )
  }) |> 
  group_by(dataset) |> 
  mutate(underfit = lm(why ~ ex)$fitted,
         `good fit` = lm(why ~ ex + I(ex^2))$fitted,
         overfit = lm(why ~ ex + I(ex^2) + I(ex^3) + I(ex^4) + I(ex^5) + I(ex^6) + I(ex^7) + I(ex^8) + I(ex^9) + I(ex^10) )$fitted) |> 
  pivot_longer(cols = underfit:overfit, 
               names_to = "fit", 
               values_to = "prediction") |> 
  mutate(fit = factor(fit, 
                        levels = c("underfit", "good fit", "overfit") ))

datasets
datasets |> 
  filter(fit == "underfit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)
datasets |> 
  filter(fit == "good fit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)
datasets |> 
  filter(fit == "overfit") |> 
  ggplot(aes(x = ex, y = why)) + 
  geom_point(size = 0.25) + 
  geom_line(aes(y = prediction)) + 
  facet_wrap(~ dataset)

What we know about bias-variance trade-off

  • The underfit model looks the same for every dataset! That is, the model has low variance.

  • The underfit model doesn’t fit the data very well. The model has high bias.

  • The overfit model model is very flexible, and looks like it fits the data very well. It has low bias.

  • The fitted models look quite different across the different datasets. The overfit model has high variance.

Goals of Simulating Complicated Models

The goal of simulating a complicated model is not only to create a program which will provide the desired results. We also hope to be able to write code such that:

  1. The problem is broken down into small pieces.
  2. The problem has checks in it to see what works (run the lines inside the functions!).
  3. uses simple code (as often as possible).

Aside: a few R functions (ifelse())

data.frame(value = c(-2:2)) |>
  mutate(abs_value = ifelse(value >=0, value, -value))  # abs val
  value abs_value
1    -2         2
2    -1         1
3     0         0
4     1         1
5     2         2

Aside: a few R functions (case_when())

set.seed(4747)
diamonds |> select(carat, cut, color, price) |>
  sample_n(20) |>
  mutate(price_cat = case_when(
    price > 10000 ~ "expensive",
    price > 1500 ~ "medium", 
    TRUE ~ "inexpensive"))
# A tibble: 20 × 5
   carat cut       color price price_cat  
   <dbl> <ord>     <ord> <int> <chr>      
 1  1.23 Very Good F     10276 expensive  
 2  0.35 Premium   H       706 inexpensive
 3  0.7  Good      E      2782 medium     
 4  0.4  Ideal     D      1637 medium     
 5  0.53 Ideal     G      1255 inexpensive
 6  2.22 Ideal     G     14637 expensive  
 7  0.3  Ideal     G       878 inexpensive
 8  1.05 Ideal     H      4223 medium     
 9  0.53 Premium   E      1654 medium     
10  1.7  Ideal     H      7068 medium     
11  0.31 Good      E       698 inexpensive
12  0.31 Ideal     F       840 inexpensive
13  1.03 Ideal     H      4900 medium     
14  0.31 Premium   G       698 inexpensive
15  1.56 Premium   G      8858 medium     
16  1.71 Premium   G     11032 expensive  
17  1    Good      E      5345 medium     
18  1.86 Ideal     J     10312 expensive  
19  1.08 Very Good E      3726 medium     
20  0.31 Premium   E       698 inexpensive

Aside: a few R functions (sample())

sampling, shuffling, and resampling: sample()

set.seed(47)
alph <- letters[1:10]
alph
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
sample(alph, 5, replace = FALSE) # sample (from a population)
[1] "i" "b" "g" "d" "a"
sample(alph, 5, replace = TRUE) # sample (from a population)
[1] "j" "g" "f" "i" "f"
sample(alph, 10, replace = FALSE)  # shuffle
 [1] "f" "h" "i" "e" "g" "d" "c" "j" "b" "a"
sample(alph, 10, replace = TRUE)  # resample
 [1] "e" "j" "e" "b" "e" "c" "f" "a" "e" "a"

Aside: a few R functions (set.seed())

What if we want to be able to generate the same random numbers (here on the interval from 0 to 1) over and over?

set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470
set.seed(123)
runif(4, 0, 1) # random uniform numbers
[1] 0.2875775 0.7883051 0.4089769 0.8830174
set.seed(4747)
runif(4, 0, 1) # random uniform numbers
[1] 0.1949071 0.3390270 0.5147919 0.4516470

Why?

Three simulation methods are used for different purposes:

  1. Monte Carlo methods - use repeated sampling from a population with known characteristics.

  2. Randomization / Permutation methods - use shuffling (sampling without replacement from a sample) to test hypotheses of “no effect”.

  3. Bootstrap methods - use resampling (sampling with replacement from a sample) to establish confidence intervals.

Monte Carlo from a population to…

  • …simulate as easier than calculus for estimating probabilities.
  • …simulate to understand complicated models.
  • …simulate to consider statistical methods with fewer assumptions.

Small simulation example

Sally & Joan

Consider a situation where Sally and Joan plan to meet to study in their college campus center (Mosteller 1987; Baumer, Kaplan, and Horton 2021). They are both impatient people who will wait only 10 minutes for the other before leaving.

But their planning was incomplete. Sally said, “Meet me between 7 and 8 tonight at the student center.” When should Joan plan to arrive at the campus center? And what is the probability that they actually meet?

Simulate their meeting

Assume that Sally and Joan are both equally likely to arrive at the campus center anywhere between 7pm and 8pm.

library(tictoc)
n <- 10000

tic()
meet_vect <- tibble(
  sally = runif(n, 0, max = 60),
  joan = runif(n, min = 0, max = 60),
  result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
)
toc()
0.002 sec elapsed
meet_func <- function(nada){
  tibble(
    sally = runif(1, min = 0, max = 60),
    joan = runif(1, min = 0, max = 60),
    result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )
  )
}
tic()
meet_map <- 1:n |> 
  map(meet_func) |> 
  list_rbind()
toc()
2.273 sec elapsed
tic()
meet_for <- data.frame()
for(i in 1:n){
  meet_for <- meet_for |> rbind(
    tibble(
      sally = runif(1, min = 0, max = 60),
      joan = runif(1, min = 0, max = 60),
      result = ifelse(
    abs(sally - joan) <= 10, "They meet", "They do not"
  )))
}
toc()
3.917 sec elapsed

Results

The results themselves are equivalent. Differing values due to randomness in the simulation.

meet_vect |> 
  group_by(result) |> 
  summarize(number = n()) |> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6966      0.697
2 They meet     3034      0.303
meet_map |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   7087      0.709
2 They meet     2913      0.291
meet_for |> 
  group_by(result) |> 
  summarize(number = n())|> 
  mutate(proprotion = number / sum(number))
# A tibble: 2 × 3
  result      number proprotion
  <chr>        <int>      <dbl>
1 They do not   6942      0.694
2 They meet     3058      0.306

Visualizing the meet up

meet_map |> 
  ggplot(aes(x = joan, y = sally, color = result)) +
  geom_point(alpha = 0.3) + 
  geom_abline(intercept = 10, slope = 1) + 
  geom_abline(intercept = -10, slope = 1) + 
  scale_color_brewer(palette = "Paired")

Simulation best practices

  • no magic numbers
  • comment your code
  • use informative names
  • set a seed for reproducibility

Bigger simulation example

Thoughts on simulating

  • break down the problem into small steps
  • check to see what works
  • simple is better

Agenda 9/15/25

  1. Simulating to understand complicated models
  2. Simulating to consider statistical methods with fewer assumptions

Modeling

  • Simulation as easier than calculus for estimating probabilities.
  • Simulation to understand complicated models.
  • Simulation to consider statistical methods with fewer assumptions.

Sensitivity of assumptions

  • Simulation as easier than calculus for estimating probabilities.
  • Simulation to understand complicated models..
  • Simulation to consider statistical methods with fewer assumptions

t-test with Poisson data

The t-test says that if the null hypothesis is true, we’d expect to reject a true null hypothesis \(\alpha \cdot 100\) percent of the time. That is, if \(\alpha = 0.05\), we’d reject true null hypotheses 5% of the time.

The guarantee of 5% (across many tests) happens only when the data are normal (or reasonably normal).

Non-normal data

data.frame(pois_data = rpois(n = 100000, lambda = 1)) |> 
  ggplot(aes(x = pois_data)) +
  geom_histogram()

t-test function

t.test.pval <- function(data1, data2){
  t.test(data1, data2) |> 
    broom::tidy() |> 
    select(p.value)
}

t.test.pval(rpois(n = 10, lambda = 1), rpois(n = 7, lambda = 1))
# A tibble: 1 × 1
  p.value
    <dbl>
1   0.696

t-test 10 times

map(1:10, ~t.test.pval(rpois(n = 10, lambda = 1),
                       rpois(n = 7, lambda = 1))) |> 
          list_rbind()
# A tibble: 10 × 1
   p.value
     <dbl>
 1  0.643 
 2  0.304 
 3  0.784 
 4  0.0874
 5  0.0379
 6  0.361 
 7  0.0215
 8  0.532 
 9  0.319 
10  0.539 

t-test many times

set.seed(4747)
results <- map(1:10000, 
               ~t.test.pval(rpois(n = 8, lambda = 1),
                            rpois(n = 5, lambda = 1))) |> 
  list_rbind() 
results |> 
  ggplot(aes(x = p.value)) + 
  geom_histogram(bins = 30) + 
  labs(x = "p.values")

results |> 
  summarize(type1 = mean(p.value < 0.05))
# A tibble: 1 × 1
   type1
   <dbl>
1 0.0498

Sensitivity of CI to tech conditions

Consider the following set up:

set.seed(474)
beta0 <- -1; beta1 <- 0.5; beta2 <- 1.5
n_obs <- 100; reps <- 1000


x1 <- rep(c(0,1), each=n_obs/2)
x2 <- runif(n_obs, min=-1, max=1)
y <- beta0 + beta1*x1 + beta2*x2 + rnorm(n_obs, mean=0, sd = 1)

Plot of data

Capture the slope parameter?

  • We know the truth (!!!!!)
  • We can create a single CI for the parameter given a single data set
  • Is the parameter in the interval?
  • Should it always be in the interval? How often should it be?
  • Repeat the process to see whether the confidence interval captures the parameter at the claimed rate.

See: Rossman Chance regression applet

Running a linear model

CI <- lm(y~x1+x2) |> tidy(conf.int=TRUE) |> data.frame()
CI
         term   estimate std.error statistic      p.value    conf.low
1 (Intercept) -0.9291632 0.1424025 -6.524909 3.098281e-09 -1.21179266
2          x1  0.3304882 0.2016861  1.638627 1.045307e-01 -0.06980279
3          x2  1.4447884 0.1919184  7.528137 2.662556e-11  1.06388341
   conf.high
1 -0.6465337
2  0.7307792
3  1.8256934
CI |>
  filter(term == "x2") |>
  select(term, estimate, conf.low, conf.high) |>
  mutate(inside = between(beta2, conf.low, conf.high))
  term estimate conf.low conf.high inside
1   x2 1.444788 1.063883  1.825693   TRUE

What happens to the CI of coefficients in repeated samples? (eq var)

eqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c(0,1), each = n()/2),
    x2 = runif(n(), min = -1, max = 1),
    y = beta0 + beta1*x1 + beta2*x2 + rnorm(n(), mean = 0, sd = 1)
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()

eqvar_data
# A tibble: 1,000 × 2
# Groups:   sim_id [1,000]
   sim_id data              
    <int> <list>            
 1      1 <tibble [100 × 4]>
 2      2 <tibble [100 × 4]>
 3      3 <tibble [100 × 4]>
 4      4 <tibble [100 × 4]>
 5      5 <tibble [100 × 4]>
 6      6 <tibble [100 × 4]>
 7      7 <tibble [100 × 4]>
 8      8 <tibble [100 × 4]>
 9      9 <tibble [100 × 4]>
10     10 <tibble [100 × 4]>
# ℹ 990 more rows
beta_coef <- function(df){
  lm(y ~ x1 + x2, data = df) |>
    tidy(conf.int = TRUE) |>
    filter(term == "x2") |>
    select(estimate, conf.low, conf.high, p.value) 
}
eqvar_data |> 
  mutate(b2_vals = map(data, beta_coef)) |>
  select(b2_vals) |> 
  unnest(b2_vals) |>
  summarize(capture = between(beta2, conf.low, conf.high)) |>
  summarize(capture_rate = sum(capture)/reps)
# A tibble: 1 × 1
  capture_rate
         <dbl>
1        0.953

Sensitivity of CI to tech conditions

Consider the following set up (note the difference in variability):

set.seed(470)
beta0 <- -1; beta1 <- 0.5; beta2 <- 1.5
n_obs <- 100; reps <- 1000

  x1 <- rep(c(0,1), each=n_obs/2)
  x2 <- runif(n_obs, min=-1, max=1)
  y <- beta0 + beta1*x1 + beta2*x2 + 
             rnorm(n_obs, mean=0, sd = 1 + x1 + 10*abs(x2))

Plot of data

What happens to the CI of coefficients in repeated samples? (uneq var)

uneqvar_data <- data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(sim_id = rep(1:reps, n_obs), 
         x1 = rep(c(0,1), each = n()/2),
         x2 = runif(n(), min = -1, max = 1),
         y = beta0 + beta1*x1 + beta2*x2 + 
              rnorm(n(), mean = 0, sd = 1 + x1 + 10*abs(x2))  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |> nest() 
uneqvar_data |> 
  mutate(b2_vals = map(data, beta_coef)) |>
  select(b2_vals) |> unnest(b2_vals) |>
  summarize(capture = between(beta2, conf.low, conf.high)) |>
  summarize(capture_rate = sum(capture)/reps)
# A tibble: 1 × 1
  capture_rate
         <dbl>
1        0.877

Equal variance: type I error?

Another method for thinking about type I error rates. With equal variance, the type I error rate is close to what we’d expect, 5%.

t_test_pval <- function(df){
  t.test(y ~ x1, data = df, var.equal = TRUE) |>
    tidy() |>
    select(estimate, p.value) 
}
set.seed(470)
reps <- 1000
n_obs <- 20
null_data_equal <- 
  data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c("group1", "group2"), each = n()/2),
    y = rnorm(n(), mean = 10, 
               sd = rep(c(1,1), each = n()/2))
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()
null_data_equal |> 
  mutate(t_vals = map(data,t_test_pval)) |>
  select(t_vals) |> 
  unnest(t_vals) |>
  ungroup(sim_id) |>
  summarize(type1error_rate = sum(p.value < 0.05)/reps)
# A tibble: 1 × 1
  type1error_rate
            <dbl>
1           0.045

Unequal variance: type I error?

With unequal variance, the type I error rate is much higher than we set. We set the the type I error rate to be 5%, but the simulated rate was 6.36%.

set.seed(470)
reps <- 1000
n_obs <- 20
null_data_unequal <- 
  data.frame(row_id = seq(1, n_obs, 1)) |>
  slice(rep(row_id, each = reps)) |>
  mutate(
    sim_id = rep(1:reps, n_obs),
    x1 = rep(c("group1", "group2"), each = n()/2),
    y = rnorm(n(), mean = 10, 
               sd = rep(c(1,100), each = n()/2))
  ) |>
  arrange(sim_id, row_id) |>
  group_by(sim_id) |>
  nest()
null_data_unequal |> 
  mutate(t_vals = map(data,t_test_pval)) |>
  select(t_vals) |> 
  unnest(t_vals) |>
  ungroup(sim_id) |>
  summarize(type1error_rate = sum(p.value < 0.05)/reps)
# A tibble: 1 × 1
  type1error_rate
            <dbl>
1           0.057

References

Baumer, Ben, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science with r. CRC Press. https://mdsr-book.github.io/mdsr2e/.
Mosteller, Frederick. 1987. Fifty Challenging Problems in Probability with Solutions. Dover Publications: Mineola, NY.